Search CORE

12 research outputs found

Toy Models of Superposition

Author: Amodei Dario
Chen Carol
Drain Dawn
Elhage Nelson
Grosse Roger
Hatfield-Dodds Zac
Henighan Tom
Hume Tristan
Kaplan Jared
Kravec Shauna
Lasenby Robert
McCandlish Sam
Olah Christopher
Olsson Catherine
Schiefer Nicholas
Wattenberg Martin
Publication venue
Publication date: 21/09/2022
Field of study

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.Comment: Also available at https://transformer-circuits.pub/2022/toy_model/index.htm

arXiv.org e-Print Archive

Evolution of Attacks, Threat Models, and Solutions for Virtualized Systems

Author: Anati Ittai
Arbaugh W. A.
Ben-Yehuda Muli
Champagne D.
Chen Shuo
Daniele Sgandurra
Eldefrawy Karim
Elhage Nelson
Emil Lupu
Franklin Jason
Garfinkel Tal
Garfinkel Tal
Garfinkel Tal
Greene James
Hoglund Greg
Irazoqui Gorka
Kauer Bernhard
Kim Taesoo
Kortchinsky Kostya
Lacoste Marc
One Aleph
Perito Daniele
Peter
Petroni Nick L.
Raffetseder Thomas
Rutkowska Joanna
Sailer Reiner
Santos Nuno
Seol J.
Tereshkin Alexander
Tereshkin Alexander
Vahldiek Anjo
Wailly Aurelien
Wang Jiang
Wojtczuk Rafal
Wu Chiachih
Wu Zhenyu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/11/2015
Field of study

Crossref

Royal Holloway - Pure

Spiral - Imperial College Digital Repository

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models

arXiv.org e-Print Archive

Language Models (Mostly) Know What They Know

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.Comment: 23+17 pages; refs added, typos fixe

arXiv.org e-Print Archive

Specific versus General Principles for Constitutional AI

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely

arXiv.org e-Print Archive

Scaling Laws and Interpretability of Learning from Repeated Data

Author: Amodei Dario
Brown Tom
Conerly Tom
DasSarma Nova
Drain Dawn
El-Showk Sheer
Elhage Nelson
Hatfield-Dodds Zac
Henighan Tom
Hernandez Danny
Hume Tristan
Johnston Scott
Joseph Nicholas
Kaplan Jared
Mann Ben
McCandlish Sam
Olah Chris
Olsson Catherine
Publication venue
Publication date: 20/05/2022
Field of study

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.Comment: 23 pages, 22 figure

arXiv.org e-Print Archive

Overexpression of arginase alters circulating and tissue amino acids and guanidino compounds and affects neuromotor behavior in mice

Author: Al Banchaabouchi
Al Banchaabouchi
Aoyagi
Aoyagi
Bachmann
Barbul
Bart Marescau
Batshaw
Batshaw
Blommaart
Ceriotti
Clarkson
Cohen
Cooke
D'Hooge
D'Hooge
D'Hooge
da Silva
Davis
De Deyn
De Deyn
De Jonge
De Jonge
de Lorgeril
Deshmukh
Deshmukh
Elhage
Featherston
Flynn
Fritsche
Herzfeld
Horowitz
Hurwitz
Jan M. Ruijter
Jun
Levillain
Levillain
Marcella M. Hallemeesch
Marescau
Marescau
Marescau
Marescau
Marescau
McGuire
McGuire
Menyhart
Moore
Musemeche
Nakamura
Nelson
Nicolaas E. P. Deutz
O'Dell
Palmer
Peter P. De Deyn
Prins
Ratnakumari
Robin
Robinson
Rudi D'Hooge
Sipila
Son
Thornton
Tomida
van Eijk
Visek
Walker
Watanabe
Windmueller
Wouter H Lamers
Wouter J. de Jonge
Wu
Wu
Wyss
Yu
Zamora
Publication venue: 'American Society for Nutrition'
Publication date: 01/01/2001
Field of study

Arginine is an intermediate of the ornithine cycle and serves as a precursor for the synthesis of nitric oxide, creatine, agmatine and proteins. It is considered to be a conditionally essential amino acid because endogenous synthesis only barely meets daily requirements. In rapidly growing suckling neonates, endogenous arginine biosynthesis is crucial to compensate for the insufficient supply of arginine via the milk. Evidence is accumulating that the intestine rather than the kidney plays a major role in arginine synthesis in this period. Accordingly, ectopic expression of hepatic arginase in murine enterocytes by genetic modification induces a selective arginine deficiency. The ensuing phenotype, whose severity correlates with the level of transgene expression in the enterocytes, could be reversed with arginine supplementation. We analyzed the effect of arginine deficiency on guanidine metabolism and neuromotor behavior. Arginine-deficient transgenic mice continued to suffer from an arginine deficiency after the arginine biosynthetic enzymes had disappeared from the enterocytes. Postweaning catch-up growth in arginine-deficient mice was characterized by increased levels of all measured amino acids except arginine. Furthermore, plasma total amino acid concentration, including arginine, was significantly lower in adult male than in adult female transgenic mice. Decreases in the concentration of plasma and tissue arginine led to significant decreases in most metabolites of arginine. However, the accumulation of the toxic guanidino compounds, guanidinosuccinic acid and methylguanidine, corresponded inversely with circulating arginine concentration, possibly reflecting a higher oxidative stress under hypoargininemic conditions. In addition, hypoargininemia was associated with disturbed neuromotor behavior, although brain levels of toxic guanidino compounds and ammonia were normal

Maastricht University Research Portal

Repository TU/e

Crossref

Pure OAI Repository

Institutional Repository Universiteit Antwerpen

Inhibition of Caspase-1 Activation in Endothelial Cells Improves Angiogenesis

Author: Abdul Muneer
Adamson
Ait-Oufella
Andries
Ann L. Cannella
Annex
Baffert
Blacher
Chen
Chowienczyk
Combadière
DeCicco-Skinner
Duewell
Elhage
Eric T. Choi
Gage
Goranov
Haas
Hang Xi
Hong Wang
Jahaira Lopez-Pastrana
Jiang
Jun Nelson
Kessler
Kirii
Libby
Limbourg
Liu
Lucas M. Ferrer
Luttun
Mayhew
Menu
Mestas
Miao
Millauer
Munshi
Natori
Neufeld
Princess I. Imoukhuede
Ramon Cueto
Schielke
Schmitz
Shalaby
Silvestre
Stillman
Usui
Van Belle
Xiao-Feng Yang
Xiaojin Sha
Xinyu Xiong
Xinyuan Li
Xiong
Xuebin Qin
Ya-Feng Li
Yang
Yaoita
Yin
Yin
Yin
Zechariah
Zhang
Zhou
Publication venue: 'American Society for Biochemistry & Molecular Biology (ASBMB)'
Publication date
Field of study

Crossref